Walkthrough

Column

INTRODUCTION

This pipeline walkthrough is for fastq ITS2 data, and is coded in R.

Included within this document are six sections as seen above this text in the navigation bar.

The Walkthrough tab (this one!) provides a step-by-step walkthrough of the pipeline, including common errors. All lines of code are numbered to easily find where you are in the pipeline.

The Flowchart tab is a brief overview of the pipeline, including what outputs are given. The flowchart matches the navigation bar on the left, so it’s easy to find steps you’re looking for :)

The Variables to be Changed tab includes all lines of code within the pipeline that may vary based on your data/file directories: list of changes that may need to be made based on your data.

The Troubleshooting tab contains common errors that may arise .

The Outputs tab provides a list of outputs from this pipeline, as well as descriptions as to what information is within them.

The Server tab is a guide for moving onto the server (pipeline and files), as well as moving files off the server. If you are just beginning to use ARC, instructions and available files that will allow the pipeline to run smoothly are here!

The Considerations tab is a list of things to consider when analyzing data.

The Graphing tab includes how to create a basic interactive barplot that allows for the easy reviewing of data.

To the left of every page is a navigation bar: this allows you to quickly find any part of the pipeline/information you need.

Any lines of code with a ‘#’ in front of it will not run.

Files Needed

Ensure you know where all the following files are located on your computer (file directories):

Samples - these will be fastq.gz files. Often named as such: 1_1S_L001_R1_001.fastq.gz for the forward reads and 1_1S_L001_R2_001.fastq.gz for the reverse reads. Unfortunately, as of now samples that do not make it through the filtering steps of the pipeline will need to be removed manually (more on this in the filtering sections of the walkthrough).

Database - find the Nematode ITS2 database for dada2 here: https://www.nemabiome.ca/its2-database.html You will find it under the Analysis tab, called ITS2 Database. The dada2 format works for IdTaxa and AssignTaxonomy, the two classification algorithms that use the database.

R file - this is the pipeline. It will be called FinalPipelineSERVER.R for the server version, and FinalPipelineLOCAL.R for a local version.

Packages

These are packages this pipeline needs. They contain various functions used within the code.
If there is a problem with loading packages, check the Troubleshooting tab.

5   library(DECIPHER)
6   library(dada2)                        
7   library(ShortRead)                    
8   library(Biostrings)
9   library(ggplot2)
10  library(phyloseq)
11  library(dplyr)
12  library(reshape2)
13  library(GenomeInfoDbData)
14  library(openxlsx)
15  library(BiocManager)
16  library(stringr)


Setting Path to Samples

Here, you are telling R where your samples are located. This directory will need to be changed depending on where your samples are on the server/locally. The Variables to be Changed tab contains all directory changes in a more concise manner.

23  path <- "/home/jill.derijke/Test10Gdeer" # server

The “pattern” below is the end of your sample file names - this pattern must match your samples! With this pattern forward and reverse reads are distinguished (in this case R1 = forward reads and R2 = reverse reads).

25  fnFs <-sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE)) 
26  fnRs <-sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))


Locating Primers + Pre-Filtering

The lines below will need to be changed if the primers used for your data are different.

33  FWD <-"ACGTCTGGTTCAGGGTTGTT" # NC1 primer
34  REV <- "TTAGTTTCTTTTCCTCCGCT" # NC2 primer


Now that the sequence of the primers used are identified, the code below allows these primers to be found in all orientations.

36  allOrients<-function(primer) {  
37   require(Biostrings)
38   dna<-DNAString(primer) 
39   orients<-c(Forward=dna, Complement=complement(dna), Reverse=reverse(dna),
40             RevComp=reverseComplement(dna))
41   return(sapply(orients, toString)) #now converting back into character vector
42  }
43  FWD.orients <- allOrients(FWD)
44  REV.orients <- allOrients(REV)


If run locally, running the below lines will show the primer sequence in all orientations.
If run on the server, this can be seen by viewing the “test_0000000.out” output file.

46  FWD.orients # to view output
47  REV.orients


Lines 50 and 51 create a file for all filtered reads, which will be outputted in the same file your samples are in, called “filtN”.

50  fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) 
51  fnRs.filtN <- file.path(path, "filtN", basename(fnRs))



The line below is the pre-filtering step: it removes Ns from the sequences to allow for the removal of primers.

52  filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = 12) # pre-filtering to remove Ns


Now, the number of primers in all orientations will be counted with the code below.

#counting number of primers 
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
primer_counts <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), 
                       FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]), 
                       REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]), 
                       REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))


The line below saves a count of the number of primers present in your samples as an rds file

saveRDS(primer_counts, "primer_countsBEFORE.rds")


Cutadapt - Primer Removal


The following block of code will remove primers from your reads.
You will need to change the directory for this external program to wherever it is on the server/computer.

# Load cutadapt
cutadapt <- "/home/jill.derijke/miniconda3/bin/cutadapt" # server // change to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R


The lines below create a file with all cutadapted samples, again found in the same file as your samples.

#now creating output filenames for cutadapted files, and define parameters to give cutadapt command
#critical parameters are the primers and they need to be in the right orientation

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs)) 

FWD.RC <- dada2:::rc(FWD) # these lines allow for the reverse-complement of the primer to be found
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 


The following code contains the actual cutadapt command, which will remove primers from the reads. Parameters should not be changed here, as cutadapt is a little tricky to get working and the command line parameters are harder to understand. However, the documentation for cutadapt is linked in the Links tab if wanted (also to better understand the tool).

# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs.filtN[i], fnRs.filtN[i])) # input files
}


After running cutadapt, primers should be removed! This can and should be checked by the lines below, which will search for primers in your reads. This information will be saved as an rds file as well which will be able to be imported back into R locally to view.

# remember to check things - can check presence of primers in the first cutadapted sample 
primercut_counts <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), 
                          FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]), 
                          REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), 
                          REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))

saveRDS(primer_counts, "primer_countsAFTER.rds")


The lines below sort forward and reverse cutadapted reads respectively, and keep them seperate to be further worked through the pipeline. If file names were changed earlier in the pipeline (see Setting Path to Samples), ensure they match the file names defined here!

# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq.gz", full.names = TRUE)) #remember to match file names 
cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq.gz", full.names = TRUE))


Main Filtering

#filtering reads 
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))


All of the parameters are default except for the maxEE parameter, which has been recommended to be more relaxed for reversed reads. A full explanation of all parameters found below can be found in the documentation, located within the Links tab. Briefly, maxN removes reads with more Ns than maxN specifies. The pipeline does not allow Ns to be included within the reads, so this is set to 0. maxEE specifies the maximum amount of expected errors allowed in forward and reverse reads respectively. Eg. 2 expected errors allowed in forward reads, and 5 allowed in reverse reads. The reverse read parameter has been relaxed due to the inherent worse quality of reverse reads. truncQ reads are truncated as soon as the quality becomes worse than the truncQ value specified. minLen removes reads that are shorter than the specified length (in this case 50) - after the other filtering steps are completed. rm.phix removes reads that match the phiX genome. There are many more parameters as can be found through viewing the documentation.

out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 5), 
                     truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 12)  


The number of reads that made it into the filtering step and the number that made it out by sample can be seen in the final summary under the TrackReads tab. Depending on your samples this is an indicator of the quality of your reads.

The lines below list the names of your samples to be used later on. It obtains the sample name through splitting the full file name. For example, the sample file called 1_1S_L001_R1_001.fastq.gz will be split at the specified pattern “001_R” to become 1_1S_L.

# Extract sample names, assuming filenames have same format, may need to alter!!!
get.sample.name <- function(fname) strsplit(basename(fname), "001_R")[[1]][1]
sample.names <- unname(sapply(filtFs, get.sample.name))
sample.names


Below is the learning errors phase of the pipeline, where it learns the error rates of your samples. See the documentation for the LearnErrors function in the [Links tab] for more information!

#learning errors
errF <- learnErrors(filtFs, multithread = 12)
errR <- learnErrors(filtRs, multithread = 12)


The lines below save the results from learning errors to an rds file, which can be imported back into R and viewed locally. In the RDS_Graphing_FinalPipeline.R file, there will be code to enable this information to be seen as a plot.

saveRDS(errF, "errF.rds")
saveRDS(errR, "errR.rds")


These lines of code denoise the data, and ensure there are no replicate reads.

# denoising data 
derepFs <- derepFastq(filtFs, verbose = TRUE)
derepRs <- derepFastq(filtRs, verbose = TRUE)


Below sample inferencing occurs. The paper explaining the algorithm can be found in the Links tab. To my understanding, this resolves the reads to sequence variants.

#sample inference - at this step the core sample inference algoithm is applied to the dereplicated data 
dadaFs <- dada(derepFs, err = errF, multithread = 12)
dadaRs <- dada(derepRs, err = errR, multithread = 12)


Merge and Concatenate Reads

With ITS2 data, there is a possibility of the forward and reverse read not overlapping. With the normal dada2 pipeline, there is one function that merges OR concatenates forward and reverse reads, discarding those that do not overlap sufficiently or concatenating them all with 10 Ns between them. The code below combines these processes in a sequential manner so all reads that did not merge sufficiently are concatenated instead. A link to the source code and a little explanation from the writer is included in the Links section. A diagram is included below for added clarity.


Merge and Concatenate Diagram

The code below merges all reads 3 times:
1. mergedData: Using the most stringent merging parameter with 0 mismatches allowed in the overlap
2. maxMismatch: Using a relaxed merging parameter with 4 mismatches allowed in the overlap
3. concats: Concatenating the reads - merging them with 10 Ns

# prints files to those connections 
# maxMismatch=0 has higher penalty for mismatch and gap for nwalgin (-64)
cat("\n\nperfect match merging....\n")
mergedData <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, maxMismatch=0, returnRejects=TRUE, verbose=TRUE)
# maxMismatch > 0 has lower penalty for mismatch and gap for nwalgin (-8)
cat("\n\nmismatch allowed merging....\n")
mismatch_mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, maxMismatch=4, returnRejects=TRUE, verbose=TRUE)
cat("\n\nconcatenation merging....\n")
concats <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, justConcatenate=TRUE, verbose=TRUE)


The code below is not run (denoted by the # in front of each line). The reason it is not ‘activated’ is because it creates a file for each sample of all reads that were loosely merged, strictly merged, and concatenated. This means 3 files for every sample, which becomes a lot of files - turning 10 sample files into 10 + (3 x 10). It is advised to leave this so that it does not run unless needed/wanted or when running a few samples.

# # save sample merge dfs for diff comparisions
# sapply(names(mergedData), function (x) write.table(mergedData[[x]], file=paste(x, "strict_mergers.txt", sep="_"),
#                                                    sep="\t", row.names=FALSE, quote=FALSE) )
# 
# sapply(names(mismatch_mergers), function (x) write.table(mismatch_mergers[[x]], file = paste(x, "loose_mergers.txt", sep="_"),
#                                                          sep="\t", row.names=FALSE, quote=FALSE) )
# 
# # converting concat df to matrix to overcome error of - unimplemented type 'list' in 'EncodeElement' 
# sapply(names(concats), function (x) write.table(as.matrix(concats[[x]]), file=paste(x, "all_concats.txt", sep="_"),
#                                                 sep="\t", row.names=FALSE, quote=FALSE) )


The code below seems to only keep rows that have content. These lines were not added by me but is needed for the rest of the merge and concatenate process to run smoothly.

# TEST --> If the row has 0 rows, then remove it from list?  
mergedData_test <- mergedData[sapply(mergedData, function(x) dim(x)[1]) > 0]
mismatch_mergers_test <- mismatch_mergers[sapply(mismatch_mergers, function(x) dim(x)[1]) > 0]
concats_test <- concats[sapply(concats, function(x) dim(x)[1]) > 0]


In the lines below, the code creates a “rowsToDelete” vector so - for example - ASVs that are concatenated that merged perfectly can be dropped later (as we only want to keep the best merging of the forward and reverse read).

  # replace mismatched or concatenated ASVs in the main mergedData
  for(i in names(mergedData_test)) {
  # store row index to drop certain ASVs later
  rowsToDelete = vector()


The lines below create dataframes of the 3 versions of merged/concatenated reads, creating mergedDf, mismatchDf, and concatDf.

  mergedDf = mergedData_test[[i]]
  cat(i, "Out of total", sum(mergedDf$abundance), "paired-reads (in", nrow(mergedDf), "unique pairings), retained ")
  mismatchDf = mismatch_mergers_test[[i]]
  rownames(mismatchDf) = paste(mismatchDf$forward, mismatchDf$reverse, sep = "_")
  concatDf = concats_test[[i]]
  rownames(concatDf) = paste(concatDf$forward, concatDf$reverse, sep = "_")


Below, any rows in the mergedDf (which are attempted perfectly merged reads) that are accepted due to a good merge will be skipped, or accepted as merged reads.

  for (row in 1:nrow(mergedDf)) {
    # skipping rows that are good to go from default analysis
    if (mergedDf[row,]$accept) { next } 

    uniquePairID = paste(mergedDf[row,]$forward, mergedDf[row,]$reverse, sep="_")


If the code detects a match length within the mergedDf that is less than 12, then the overlap is not sufficient and those reads will be replaced with the results of those reads from the concatDf (which contains only concatenated reads).

    # if match length is less than 12 then good to go with concatenation
    if (mismatchDf[uniquePairID,]$nmatch <= 12) {
      mergedDf[row,] = concatDf[uniquePairID,]
    }


ASVs with a long overlap and many mismatches are ignored however, which reduces bias due to a longer overlap between forward and reverse reads naturally increasing the probability of mismatches present within the lengthy overlap. The code does this in “ranks”, so not to just ignore the same number of mismatches in various lengths of overlap.

    # ignoring ASVs with many mismatches with long overlap
    else{
      misMatchIndels = mismatchDf[row,]$nmismatch + mismatchDf[row,]$nindel
      cutOff = 0


The line below enforces a cutoff of 1 mismatch in a 50 bp or less overlap.

      if ( mismatchDf[uniquePairID,]$nmatch > 12 && mismatchDf[uniquePairID,]$nmatch <= 50 ) {
        cutOff = 1
      }


The line below enforces a cutoff of 2 mismatches in a 100 bp or less overlap.

      else if ( mismatchDf[uniquePairID,]$nmatch > 50 && mismatchDf[uniquePairID,]$nmatch <= 100 ) {
        cutOff = 2
      }


The line below enforces a cutoff of 3 mismatches in an overlap greater than 100 bp.

      else if (mismatchDf[uniquePairID,]$nmatch > 100) {
        cutOff = 3
      }
      # check if mismatches are below cut off
      if (misMatchIndels <= cutOff) {
        mergedDf[row,] = mismatchDf[uniquePairID,]
      } 
      # discard if mismatches are high in longer than 100bp match 
      else if (cutOff==3 && misMatchIndels > 3) {
        rowsToDelete = c(rowsToDelete, row)
      }
      


The code below will concatenate reads if the number of mismatches is too high in the overlap region. However, the overlap will still be on the ends of the reads (eg. if the end of the forward read has the sequence ATATATATAT, the reverse read will have this same sequence at the beginning of it if they overlap). Thus, the overlap region of the reverse read is trimmed so they are concatenated without the overlap region being present twice.

      # concatenate if mismatches are high in overlap region remove reverse read part of the overlap region 
      else {
        trimLength = mismatchDf[uniquePairID,]$nmatch
        concatDf[uniquePairID,]$sequence = gsub(paste0("N{10}[A-z]{", trimLength, "}"), "NNNNNNNNNN", concatDf[uniquePairID,]$sequence)
        mergedDf[row,] = concatDf[uniquePairID,]
      }
    }
  }


The lines below delete the “rowsToDelete” vector that was created earlier.

  if (length(rowsToDelete) == 0){
    mergedData_test[[i]] = mergedDf
  } else {
    mergedData_test[[i]] = mergedDf[-rowsToDelete,]
  }
  cat(sum(mergedData_test[[i]]$abundance), "paired-reads (in", nrow(mergedData_test[[i]]), "unique pairings)\n")
}


At this point, reads are merged and/or concatenated to the best of their ability, with a very minimal number of reads lost! The number of reads lost at this step will be able to be viewed in the TrackReads tab of the final summary - not many should be lost. The lines below save lists of the different merging parameter for each ASV as rds files, to be imported back into R and viewed locally. The ASV sequence, abundance of the ASV, number of mismatches, and whether or not the ASV was accepted as a good merge for the respective parameters are listed by sample, as well as other information.

saveRDS(mergedData_test, "mergedData_test.rds")
saveRDS(mismatch_mergers_test, "mismatch_mergers_test.rds")
saveRDS(concats_test, "concats_test.rds")


Sequence Table

With the ASVs ready, the lines below create an Amplicon Sequence Variant table and attempt to remove chimeras.

#can now construct a sequence table - Amplicon Sequence Variant Table (ASV), a higher-resolution version of the OTU table
seqtab <- makeSequenceTable(mergedData_test)

#Remove chimeras 
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=12, verbose=TRUE)


The distribution of lengths of the ASVs can be viewed within the test_000000.out file on the server.

# inspect distribution of sequence lengths
table(nchar(getSequences(seqtab.nochim)))


Tracking Reads

It is important to view the output of this step - it tracks the reads through the pipeline, and tells you how many reads were lost at each point - including all filtering stages. Depending on your data, if too many reads are being lost at the filtering stage parameters may need to be loosened. The information from these lines is in the final summary excel file of the pipeline!

#track reads through the pipeline - inspect the number of reads that made it through each step in the pipeline -- IMPORTANT
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergedData_test, getN), rowSums(seqtab.nochim))

# If processing a single sample, remove the sapply calls: e.g. replace
# sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", 
                     "nonchim")
rownames(track) <- sample.names 

#TOTAL READS 
group <- sample(6)
colSums(track, group)   


Classification: IdTaxa

IdTaxa is a relatively new classification algorithm, which classifies the ASVs from the pipeline to the highest resolution it can using a database. A link to the documentation of IdTaxa is included within the Links tab.

The lines below do not run, as is denoted by the ‘#’ in front of the line. This line allows you to ‘set a seed’, specifying a random number within the brackets to ensure reproducibility if repeating the run. Occasionally, there are differences in resolution in the classifications which setting a seed supposedly helps. I am unsure if it works on the server, but I have left the code in just incase it is wanted to be looked into further.

#set seed when doing repeats - will keep result the same with repeats
# set.seed(123) # seed is a random number


The line below allows R to find the database to be used to classify sequences and reads it. The directory will need to be changed based on where your database file is located.

fasta <- "/home/jill.derijke/dada2.fasta"  # server
database <- readDNAStringSet(fasta)


The block of code below is taking the headers from the database reference sequences to create a taxonomy file IdTaxa will use to ‘learn’ the database. More information on these processes can be found in the documentation within the Links tab.

###### IDTAXA #####
#parse headers to obtain taxonomy
s <- strsplit(names(database), ";")
domain <-sapply(s, `[`, 1)
kingdom <-sapply(s, `[`, 2)
phylum <- sapply(s, `[`, 3)
class <- sapply(s, `[`, 4)
order <- sapply(s, `[`, 5)
family <- sapply(s, `[`, 6)
genus <- sapply(s, `[`, 7)
species <-sapply(s, `[`, 8)
taxonomy <- paste("Root", domain, kingdom, phylum, class, order, family, genus, species, sep=";")

taxonomat<- as.matrix(taxonomy)


In the lines below, the LearnTaxa function will train IdTaxa on the information within the database.

#training the classifier
trainingSet<-LearnTaxa(database, taxonomy)


An rds file saves the training set to be visualized locally.

saveRDS(trainingSet, "trainingSet.rds")


The lines below create an object called ‘test’. This is an arbitrary name, but contains the ASVs in a readable format to be read by IdTaxa for classification.

#creating dna string set from the ASVs (from dada2 tutorial) to use in IDTAXA as "test" (converting it to stringset object with same format)
test <- DNAStringSet(colnames(seqtab.nochim)) 


The line below ensures the ASVs in the ‘test’ object are named properly to ensure the ASVs can be traced back to their respective samples.

names(test) <- as.character(1:ncol(seqtab.nochim)) 


Below is the first classification part of the pipeline. IdTaxa classifies the ASV based on the database provided to it earlier. Parameters can be changed here, the most important one being threshold=60, which is the default. It may leave many ASVs classified only to the genus level depending on your data, however it is unadvised to loosen this parameter at the risk of sacrificing accuracy for resolution. Again, learn more about IdTaxa through documentation in the Links section.

##### PARAMETERS TO CHANGE ##### (line below)

# classifying 
ids <- IdTaxa(test, trainingSet, threshold = 60, bootstraps = 100, strand = "both", processors = NULL, verbose = TRUE, type="extended") 


The line below provides the number of ASVs, which can be a nice double check at the end of the pipeline that none have mysteriously gone missing.

length(ids) # - gives you number of ASVs


The “ids” object itself is a file which has all the confidence results for the ASVs. This information can also be found in the final excel sheet summary under IdTaxaConfidence, but is also saved as an rds file to be imported into R locally to view.

saveRDS(ids, "ids.rds")


The lines below organize the confidence values by rank.

idsFunstuff <- function(i){
  ids[[i]][['rank']]<-c("rootrank", "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
  return(ids[i]) 
}
idsdone <- sapply(1:length(ids), idsFunstuff)


IdTaxa Classification Table

The dada2 pipeline uses specific matrices after classification that summarize the information very well and includes sample names and read abundances, whereas the IdTaxa classification method only outputs the “ids” object mentioned above. Hence, the IdTaxa output needs to be modified to a matrix better suited to the pipeline, as is done with the following code:

# Needed to convert back to analogous taxa matrix from dada2 pipeline tutorial 
Species <- sapply(idsdone,
              function(x) {
                w <- which(x$rank=="species") 
                if (length(w) != 1) {
                  "unknown"
                } else {
                  x$taxon[w]
                }
              })
table(Species)

Genus <- sapply(idsdone,
            function(x) {
              w <- which(x$rank=="genus") 
              if (length(w) != 1) {
                "unknown"
              } else {
                x$taxon[w]
              }
            })
table(Genus)

taxon <- sapply(ids,
            function(x)
              x$taxon[length(x$taxon)])

# Converting the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy

ranks<- c( "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
taxid <- t(sapply(idsdone, function(y) {
  m <- match(ranks, y$rank)
  taxa<- y$taxon[m]
  taxa
}))
colnames(taxid) <- ranks
rownames(taxid) <- getSequences(seqtab.nochim)

taxa<- taxid 
taxa.print <- taxa  # Removing sequence rownames for display only
rownames(taxa.print)<-NULL

The lines below create a dataframe that is only names of the samples derived from the file names to eventually use further down the pipeline. IMPORTANT: If there are duplicate names, R will merge them - ensure they are unique!!.

## constructing df of sample data (not using any variables other than x kind of sample)
## if there are duplicate names, R will merge them so make sure they're unique
samplenam<- rownames(seqtab.nochim)

The first line below can be altered based on your data, but ensure the sample name is obtained at a point that will ensure no sample names are the same. For example, if I specify the file name to split at “_S" for two files that are named 193*_S193*_L001_R1_001.fastq.gz and 193*_S342*_L001_R1_001.fastq.gz, I will end up with sample names that are 193 and 193, even though the samples are different. The information for these files will merge, and will break the rest of the pipeline. As of now, the sample name is split from “001_R”, which is typically what works for the fastq files inputted. The other two lines create and format the sample name dataframe properly.

nameSample <-sapply(strsplit(samplenam, "001_R"), `[`, 1)
dfagain<-data.frame(Samples=nameSample) 
rownames(dfagain)<-samplenam

The phyloseq package is an R package that is created to analyze microbiome data. It is used here to format data in more user-friendly and clear way. More information will be available within the Links tab. Once properly organized, it is converted into a dataframe for easier use/manipulation.

phyABC<-phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), sample_data(dfagain), tax_table(taxa))

physp <- tax_glom(phyABC, "species") # collapses them properly 
classif <- psmelt(physp) #converts into dataframe 
colnames(classif)[1]<- c("ASV")
classif$ASV <- NULL
classif$Sample <- NULL

### ORGANIZING DATA INTO TABLE 

AllInfoClassif <- tax_table(phyABC)@.Data

seqTable <- as.data.frame(t(seqtab.nochim))

SeqTabAndClassifiedInfo <- cbind(AllInfoClassif, seqTable)

Below, a column is added listing the ASVs.

## add column with ASV #
SeqTabAndClassifiedInfo$ASV <- c("ASV")
SeqTabAndClassifiedInfo$ASV <- paste0("ASV_", seq.int(nrow(SeqTabAndClassifiedInfo)))

The line below adds another column where the ASV and classification are jointly listed. This is for the purpose of graphing and to ensure the ASV and classification matching did not somehow get messed up further down the pipeline.

## getting header to include ASV and species name - adding column with both
SeqTabAndClassifiedInfo$ASVnumber <- paste(SeqTabAndClassifiedInfo$species, SeqTabAndClassifiedInfo$ASV, sep = "-")


Classification: AssignTaxonomy

This is the second classification algorithm that will classify ASVs within this pipeline. More information on AssignTaxonomy can be found in the Links tab, but briefly AssignTaxonomy is what originally came with the Dada2 pipeline, and after testing and various comparisons with IdTaxa there are multiple observations. AssignTaxonomy is less conservative than IdTaxa, and sometimes makes assumptions, often sacrificing accuracy for resolution. As well as this, less information is provided about the classifications. There is also less wiggle room in terms of parameters that can be altered. It has been left as part of the pipeline as it can be useful for a second check on top of BLAST.

ASSIGNTAXAtaxa <- assignTaxonomy(seqtab.nochim, fasta, multithread = 12, tryRC = TRUE, taxLevels = c("domain","kingdom", "phylum","class", "order","family", "genus", "species"))

The phyloseq package is once again used for data formatting.

phyAssignTaxa<-phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), sample_data(dfagain), tax_table(ASSIGNTAXAtaxa))

AssignTaxAllInfo <- tax_table(phyAssignTaxa)@.Data

The lines below summarize how many reads were classified to the highest resolution (species) by each classification algorithm.

AssignedToSpAT <- sample_sums(tax_glom(phyAssignTaxa, "species")) # summary of reads to check
AssignedToSpIdTax <- sample_sums(tax_glom(phyABC, "species")) # summary of reads to check

The lines below further format data for the final summary, including combining dataframes and including columns containing ASV and classification information.

ATSeqTabAndClassifiedInfo <- cbind(AssignTaxAllInfo, seqTable)

## add column with ASV #
ATSeqTabAndClassifiedInfo$ASV <- c("ASV")
ATSeqTabAndClassifiedInfo$ASV <- paste0("ASV_", seq.int(nrow(ATSeqTabAndClassifiedInfo)))

## getting header to include ASV and species name - adding column with both
ATSeqTabAndClassifiedInfo$ASVnumber <- paste(ATSeqTabAndClassifiedInfo$species, ATSeqTabAndClassifiedInfo$ASV, sep = "-")

Creating Fasta File of ASVs

This block of code creates a fasta file of the ASVs which will be used by BLAST.

##### Make FASTA file of ASVs #####
# Should be same regardless of IdTaxa or assigntaxonomy.

Below, sample names are extracted from file names. Ensure no two samples are named the same when specifying a pattern to split the file name by. The IdTaxa Classification Table explains this topic further.

samplenames <- sapply(strsplit(basename(filtFs), "_L001_R"), '[', 1) 

The code below formats the ASV sequence table into a fasta file.

mname <- "taxid"

getseq <- function(i){
  colnames(seqtab.nochim)
}
asv_seqs <- lapply(1, getseq)
asv_seqs <- unlist(asv_seqs[[1]]) 

getheads <- function(reg){
  asv_head <- vector(dim(seqtab.nochim)[2], mode="character")
  for (i in 1:length(asv_head)) {
  asv_head[i] <- paste0(">ASV_", i)
  }
  return(asv_head)
}
asv_headers <- lapply(1, getheads)
asv_headers <- unlist(asv_headers)

asv_fasta <- c(rbind(asv_headers, asv_seqs))

The line below saves the fasta file.

write(asv_fasta, file = "ASV.fasta", sep = "/")

BLAST

This is the final classification method of the pipeline. The ASVs are BLASTed through NCBI. There are important directories that must be changed here!!

# Same regardless of using IdTaxa or AssignTaxonomy
# Import blastn python package*

Change the path to this to tell R where the blastn file is. If using the server, the path should be the same other than the username.

blastn <- "/home/jill.derijke/miniconda3/bin/blastn" # server

The fasta file from before is now imported back into the pipeline to be used by BLAST. Change the path to where it was written to. If the fasta file name was changed when writing the file, it must be changed here too.

# Load the ASV fasta file to be blastn-ed. Ensure directory is correct
FAS <- "/home/jill.derijke/ASV.fasta" # server

This is the home directory, and where the output file of BLAST will end up. It is important to know where these files are going!! Change the directory to where the file will go (if anything just change the username).

# Set location of output file of blast results
homedir <- "/home/jill.derijke/" # server

The code below is the actual BLAST process. It is slow!! Note the file “BlastResults.out”. The name for this file can be changed, but must be changed later on as well. It should be noted that the parameter max_target_seqs is not necessarily the top 5 hits. A paper highlighting this issue and its impacts on bioinformatic analysis is linked in the Links tab.

# blastn the ASVs ; TAKES A WHILE!!! Loosened blastn parameters of what it should return as a hit (i.e. keep everything)
system2(blastn, paste("-db nt -query ", FAS, "-out",
                      paste0(homedir,"BlastResults.out"), "-remote", "-max_target_seqs", 5, "-qcov_hsp_perc", 25, "-perc_identity", 0,
                      "-task blastn",
                      "-outfmt", "\"6 qseqid sseqid sscinames pident length mismatch gapopen qstart qend sstart send evalue bitscore score
                      qgi qacc qaccver qlen sallseqid sgi sallgi sacc saccver sallacc slen nident gaps frames qframe sframe staxids scomnames
                      sblastnames sskingdoms stitle salltitles sstrand qcovs\""))

# BLAST is run on the server. The above is what is used to do so (with modified inputs and outputs).

To import BLAST results for further use, import the output file back into the pipeline with the line below. This will need to be changed to the path to your file.

# Import & look at BLAST results + the outputs
blast_asv <- read.table(file = "/home/jill.derijke/BlastResults.out",
                        sep = "\t", header = FALSE, stringsAsFactors = FALSE, fill = TRUE) # server
                        

The lines below create a table format of the BLAST results that is compatible with the rest of the pipeline.

blast_asv_title <- select(blast_asv, "ASV" = V1, "BLAST seqid" = V2, "identity" = V4, "bitscore" = V13, "Assignment" = V33)
write.table(blast_asv_title, file = "BLASTResults.csv")

Formatting Data for Summary

The code within this section includes data cleanup for the final excel workbook containing all significant classification information. At this stage in the pipeline, the hits from BLAST are listed in column format, meaning 5 hits for one ASV are listed one below the other. Leaving the data formatted in this way creates a very long table, with (number of BLAST hits) x (number of ASVs) = number of rows. The number of rows will easily reach the 1000s, thus the table is reorganized to list the ASVs one by one and have the BLAST hits horizontally listed by ASV in a row.

The code below selects unique ASVs (eg ASV_1 and ASV_2) and collapses the results. In this way, the dataframe that was
ASV_1 blast hit 1
ASV_1 blast hit 2
ASV_1 blast hit 3

is now
ASV_1 blast hit 1 blast hit 2 blast hit 3

# Get number of unique ASVs, then collapse
u <- unique(blast_asv_title[,1])
for(i in u) {
  n <- dim(blast_asv_title[blast_asv_title$ASV == i, ])[1]
  print(1:n)
  blast_asv_title$n[blast_asv_title$ASV == i] <- 1:n
  print(blast_asv_title$n)
}

blast_asv_FINAL <- reshape(blast_asv_title, idvar = "ASV", timevar = "n", direction = "wide")


The line below takes the dataframe of classifications and combines it with the BLAST results by ASV, resulting in one large dataframe containing all classification information as well as BLAST hits corresponding to each ASV.

## Combining these two dataframes for classifications and BLAST 
FinalSummary <- merge(SeqTabAndClassifiedInfo, blast_asv_FINAL, by = "ASV", all=TRUE) 



The code below formats the object containing IdTaxa confidence results into data that is compatible for input into an excel sheet, as well as corresponding the information to respective ASVs and samples so it is all traceable. The end result is a dataframe organized by ASV that presents the confidence percentage for each taxon as is determined by IdTaxa. In theory, it is a simple organizational concept. However, a lot of code was needed to properly execute it.

##### Final summary of IdTaxa classification confidences #####
tryingC <- sapply(ids,
              function(x)
                paste(x$confidence,
                      collapse=";"))

tryingT <- sapply(ids,
                  function(x)
                    paste(x$taxon,
                          collapse=";"))

Tdf <- as.data.frame(tryingT)
Cdf<- as.data.frame(tryingC)

## Making columns 
stringr::str_split_fixed

Cdf<- str_split_fixed(Cdf$tryingC, ";", 9) ## works 

colnames(Cdf) <- c("root","domain", "kingdom", "phylum", "class", "order","family", "genus", "species")
rownames(Cdf) <- SeqTabAndClassifiedInfo$ASVnumber

Tdf <- str_split_fixed(Tdf$tryingT, ";", 9) ## works 

colnames(Tdf) <- c("root","domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
rownames(Tdf) <- SeqTabAndClassifiedInfo$ASVnumber

TDF <- as.data.frame(Tdf)
CDF <- as.data.frame(Cdf)
TestDF <- TDF

TestDF$ASV <- paste(SeqTabAndClassifiedInfo$ASV) ## changed
TestDF$root2 <- paste(TestDF$root, CDF$root, sep = "-") ## combining tax and confidence columns 
TestDF$domain2 <- paste(TestDF$domain, CDF$domain, sep = "-") ## combining tax and confidence columns 
TestDF$kingdom2 <- paste(TestDF$kingdom, CDF$kingdom, sep = "-") ## combining tax and confidence columns 
TestDF$phylum2 <- paste(TestDF$phylum, CDF$phylum, sep = "-") ## combining tax and confidence columns 
TestDF$class2 <- paste(TestDF$class, CDF$class, sep = "-") ## combining tax and confidence columns 
TestDF$order2 <- paste(TestDF$order, CDF$order, sep = "-") ## combining tax and confidence columns 
TestDF$family2 <- paste(TestDF$family, CDF$family, sep = "-") ## combining tax and confidence columns 
TestDF$genus2 <- paste(TestDF$genus, CDF$genus, sep = "-") ## combining tax and confidence columns 
TestDF$species2 <- paste(TestDF$species, CDF$species, sep = "-") ## combining tax and confidence columns 

FinalConfSum <- select(TestDF,-c(1:9))
colnames(FinalConfSum) <- c("ASV", "root","domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")


The code below formats all information regarding read counts into one dataframe. Information is taken from three dataframes, each containing numerical data on reads including a sum after each major filtering step in the pipeline, the number of reads classified to the species level for AssignTaxonomy and IdTaxa respectively. This numerical data is combined into one dataframe to be included in the final summary.

# summary tracking reads through the pipeline 
tttt <- as.data.frame(track)
tttt1 <- tibble::rownames_to_column(tttt, var="Samples")

atat1 <- as.data.frame(AssignedToSpAT)
atat2 <- tibble::rownames_to_column(atat1, var="Samples")

idid <- as.data.frame(AssignedToSpIdTax)
idid1 <- tibble::rownames_to_column(idid, var="Samples")

TrackReadsClassif <- merge(atat2, idid1, all=TRUE, by = "Samples" )



The lines below ensure formatting for the final summary excel sheet is consistant, and adds a column at the beginning of the data frames with ASV numbers.

## including the ASV as first row of sheet 
ATSeqTabAndClassifiedInfo2 <- cbind(SeqTabAndClassifiedInfo$ASV, ATSeqTabAndClassifiedInfo)

SeqTabAndClassifiedInfo2 <- cbind(SeqTabAndClassifiedInfo$ASV, SeqTabAndClassifiedInfo)



The code below is creating a dataframe with all species classifications collapsed instead of by ASV. This way, it is easy to briefly review the species composition of all samples.

phyTEST <- tax_glom(phyABC, "species", NArm = FALSE) # collapses them properly 

classifTEST <- psmelt(phyTEST) #keep - converts into dataframe 
colnames(classifTEST)[1]<- c("ASV")
classifTEST$ASV <- NULL
classifTEST$Sample <- NULL

require(reshape2)
newdfTEST <- reshape(classifTEST, timevar = "Deer", idvar = c( "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), direction = "wide")
names(newdfTEST) <- gsub(pattern = "Abundance.", replacement = "", x = names(newdfTEST), ignore.case = TRUE) #removing part of sample name that is not needed


Summarizing into Excel Workbook

This is the final step in the pipeline!! Now that all information is properly formatted, the code below converts it all into multiple excel sheets within one excel workbook for easy viewing! Each code section from here on is a sheet in the excel workbook.

wb <- createWorkbook() ## Save workbook


The sheet below includes collapsed IdTaxa classifications where IdTaxa classification results are collapsed by species.

# sheet of collapsed IdTaxa classifications (by species)
IdTaxCollapsedSheet <- addWorksheet(wb, sheetName = "CollapsedSpeciesIdTaxa", tabColour = "#800000")
writeData(wb, IdTaxCollapsedSheet, newdfTEST, colNames = TRUE, rowNames = FALSE)


The sheet below contains IdTaxa classifications by ASV.

# sheet of IdTaxa classifications 
IdTaxSheet <- addWorksheet(wb, sheetName = "IdTaxaClassification", tabColour = "#CA0000")
writeData(wb, IdTaxSheet, SeqTabAndClassifiedInfo2, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#CA0000" )
addStyle(wb, "IdTaxaClassification", cols = 1, rows=1:(nrow(SeqTabAndClassifiedInfo2)+1), style = colForcol)


The sheet below contains the confidence values of each taxonomy rank by ASV.

# IdTaxa classification confidence values sheet 
IdTaxConfidenceSheet <- addWorksheet(wb, sheetName = "IdTaxaConfidence", tabColour = "#FF0000")
writeData(wb, IdTaxConfidenceSheet, FinalConfSum, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF0000" )
addStyle(wb, "IdTaxaConfidence", cols = 1, rows=1:(nrow(FinalConfSum)+1), style = colForcol)


This sheet contains all BLAST results by ASV, with blast hits across a row.

# sheet of BLAST results (5 hits across a row)
blastSheet <- addWorksheet(wb, sheetName = "BLAST", tabColour =  "#FF2C2C")
writeData(wb, blastSheet, blast_asv_FINAL, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF2C2C" )
addStyle(wb, "BLAST", cols = 1, rows=1:(nrow(blast_asv_FINAL)+1), style = colForcol)


The sheet below combines IdTaxa and BLAST classification results into one large dataframe if that is preferred for easier graphing/viewing.

# # sheet with IdTaxa and BLAST results combined 
AllSheet <- addWorksheet(wb, sheetName = "All", tabColour = "#FF5C5C")
writeData(wb, AllSheet, FinalSummary, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF5C5C" )
addStyle(wb, "All", cols = 1, rows=1:(nrow(FinalSummary)+1), style = colForcol)


The sheet below includes all AssignTaxonomy classification results by ASV.

# sheet with assigntaxonomy results 
ATSheet <- addWorksheet(wb, sheetName = "AssignTaxonomy", tabColour =  "#FF8484")
writeData(wb, ATSheet, ATSeqTabAndClassifiedInfo2, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF8484")
addStyle(wb, "AssignTaxonomy", cols = 1, rows=1:(nrow(ATSeqTabAndClassifiedInfo2)+1), style = colForcol)


This sheet tracks reads through the pipeline by sample

## sheet tracking reads through the pipeline 
TrackSheet <- addWorksheet(wb, sheetName = "TrackReads", tabColour =  "#FFB3B3")
writeData(wb, TrackSheet, tttt1, colNames = TRUE, rowNames = FALSE)



The last sheet below contains how many reads were classified to the species level by AssignTaxonomy and IdTaxa respectively. AssignTaxonomy tends to have much less reads assigned to the species level.

## sheet with how many reads are classified to the species level
NumberReadsClassifiedSheet <- addWorksheet(wb, sheetName = "#ReadsClassified",tabColour =  "#FFCCCC")
writeData(wb, NumberReadsClassifiedSheet, TrackReadsClassif, colNames = TRUE, rowNames = FALSE)

InfoSheet <- addWorksheet(wb, sheetName = "Info")


The line below saves the workbook, completing the pipeline! The final excel sheet can easily be renamed within the quotes if wanted. Keep the .xlsx file extension however.

saveWorkbook(wb, file = "FinalSummary.xlsx", overwrite = TRUE)

:)

Flowchart

Column

Flowchart of the Dada2 Pipeline

Flowchart of the Dada2 Pipeline

Variables to be Changed

Contents

Contents

Changes that can be made throughout the pipeline have been highlighted within the walkthrough, but this is a shortcut as to what needs to be changed if you are already relatively familiar with the pipeline or just need to know what to change.


This section is divided into three sub-sections:


1. Directories to be Changed: This is the most important and correlates with the blue boxes within the Flowchart! All file paths to be changed throughout the pipeline will be noted here. The pipeline will not run unless the correct file paths are provided.

2. Changeable Parameters: This section contains all functions where parameters could be changed. It is highly advised to review documentation (found in the Links tab) for any parameters you want to change. The default pipeline has all parameters fitted to be ideal for ITS2 data, but of course not all data is the same. Reasons for the parameters chosen are mentioned within the Walkthrough.

3. File Name Changes: This is the least important section where changeable variables are considered. This section contains all instances of file naming, which can be changed if preferred.


4. Other: This section contains any other changes that can potentially be made to the pipeline, but that don’t fit into any of the above sections, such as primer sequences and file patterns. If your sample file names differ from the usual 193_S193_L001_R1_001.fastq.gz type format, this section will be important.


Column

Directories to be Changed


The file paths that will need to be changed are listed below. Only change what is within quotes. Some of these will remain the same after specifying for the first time (only the file path for the samples will be needed to be changed).


1. Setting Path to Samples - R needs to know where your samples are. On the server, your home directory is “/home/username/”. Locally, it is usually “~/”.

23  path <- "/home/jill.derijke/Samples" # server


2. Cutadapt - Primer Removal - Cutadapt is an external primer removal tool, and the path to it will need to be established. This path should remain the same apart from the username.

72  cutadapt <- "/home/jill.derijke/miniconda3/bin/cutadapt" # server // change to the cutadapt path on your machine


3. Classification: IdTaxa - The file path to the database must be specified here.

275 fasta <- "/home/jill.derijke/dada2.fasta"  # server


4. BLAST - Some of these changes are linked - if you have changed the file name in one line, you will have to change it in the other as well.

4.1 - Creating a fasta file, and importing it back into R. Ensure the files fasta files specified are named the same!

443 write(asv_fasta, file = "ASV.fasta", sep = "/")

...

456 FAS <- "/home/jill.derijke/ASV.fasta" # server

4.2 - Like cutadapt, BLAST is also a tool external to R, and where it is on your computer will have to be specified.

453 blastn <- "/home/jill.derijke/miniconda3/bin/blastn" # server

4.3 - The line below is the location of the output file of BLAST results. If directory is changed, you will have to update line 471 below it as well.

459 homedir <- "/home/jill.derijke/" # server

...

471 blast_asv <- read.table(file = "/home/jill.derijke/BlastResults.out",


Changeable Parameters


There are multiple functions throughout the pipeline where parameters can be changed, such as filtering and classification steps. The Walkthrough contains more information about what the functions do, and Links are provided to documentation expanding on parameters. If parameters are changed, ensure you know what you’re changing :) Be careful when changing parameters, and make note of it within the Info tab of the final summary excel sheet. The pipeline has not been tested with every parameter, and currently unknown issues may arise.



1. Main Filtering - The filtering parameters below are default, or have been recommended for ITS2 data (such as the maxEE parameter). If no data is making it through the pipeline, it may be of use to relax some of these parameters.

117 out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 5), 
                         truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 12)  



2. Classification: IdTaxa - The parameters below are default. The threshold parameter is most relevant. This is the confidence threshold of each classification, meaning the classifier must be =>60% confident in its classification at each level of classification. Lowering this number may provide higher resolution (eg. more species level classifications), but risks accuracy.

305 ids <- IdTaxa(test, trainingSet, threshold = 60, bootstraps = 100, strand = "both", processors = NULL, verbose = TRUE, type="extended") 



3. Classification: AssignTaxonomy - There is not much that can be altered here. The main parameter is not shown below, but it is minBoot, and is set to 50 as a default. It is the minimum bootstrap confidence. More information can be found in the documentation within the Links tab.

397 ASSIGNTAXAtaxa <- assignTaxonomy(seqtab.nochim, fasta, multithread = 12, tryRC = TRUE, taxLevels = c("domain","kingdom", "phylum","class", "order","family", "genus", "species"))



4. BLAST - The code below calls for BLAST run externally from the pipeline. It uses the command line, and has very many parameters. Links to these are included, however the below works with the intention of not filtering results.

462 system2(blastn, paste("-db nt -query ", FAS, "-out",
                  paste0(homedir,"BlastResults.out"), "-remote", "-max_target_seqs", 5, "-qcov_hsp_perc", 25, "-perc_identity", 0,
                  "-task blastn",
                  "-outfmt", "\"6 qseqid sseqid sscinames pident length mismatch gapopen qstart qend sstart send evalue bitscore score
                  qgi qacc qaccver qlen sallseqid sgi sallgi sacc saccver sallacc slen nident gaps frames qframe sframe staxids scomnames
                  sblastnames sskingdoms stitle salltitles sstrand qcovs\""))

File Name Changes


You may want to change the names of various output files throughout the pipeline. Some of the output files are fed back into the pipeline, which will need to be considered when altering the names. rds, xlsx, and csv files can be renamed to whatever, as these are not used again. Only change what is within quotes


1. Creating Fasta File of ASVs - The first line shown below outputs a fasta file with the name “ASV.fasta”. The second line reads the fasta file back into the pipeline to be used by BLAST. Because of this, if the file name is changed in the first line, it must be changed in the second as well so they match.

443 write(asv_fasta, file = "ASV.fasta", sep = "/")

...

456 FAS <- "/home/jill.derijke/ASV.fasta" # server


2. BLAST - Within BLAST, the output file on line 463 called “BlastResults.out” can be changed if wanted to call it something else. However, line 471 reads the BLAST results back into the pipeline, so the file directory must also reflect this change.

462 system2(blastn, paste("-db nt -query ", FAS, "-out",
463                  paste0(homedir,"BlastResults.out"), "-remote", "-qcov_hsp_perc", 25, "-perc_identity", 0,
                  "-task blastn",
                  "-outfmt", "\"6 qseqid sseqid sscinames pident length mismatch gapopen qstart qend sstart send evalue bitscore score
                  qgi qacc qaccver qlen sallseqid sgi sallgi sacc saccver sallacc slen nident gaps frames qframe sframe staxids scomnames
                  sblastnames sskingdoms stitle salltitles sstrand qcovs\""))
                  
...

471 blast_asv <- read.table(file = "/home/jill.derijke/BlastResults.out",
                    sep = "\t", header = FALSE, stringsAsFactors = FALSE, fill = TRUE) # server

Other


Some aspects of the pipeline may need to be altered based on your data:

1. Setting Path to Samples and Cutadapt - Primer Removal - The lines below specify the file pattern that sorts the samples as forward (R1) and reverse reads (R2). This pattern is specific for files named 193_S193_L001_R1_001.fastq.gz and 193_S193_L001_R2_001.fastq.gz, and may need to be changed. Only change the pattern within the quotes.

25  fnFs <-sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE)) # always ensure this pattern matches your samples
26  fnRs <-sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

...

106 cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq.gz", full.names = TRUE)) #remember to match file names 
107 cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq.gz", full.names = TRUE))



2. Locating Primers + Pre-Filtering - The code below specifies the primer sequence used for the samples. If the primers used are not those below, the sequence within the quotes will need to be changed.

33  FWD <-"ACGTCTGGTTCAGGGTTGTT" # NC1 primer
34  REV <- "TTAGTTTCTTTTCCTCCGCT" # NC2 primer



3. IdTaxa Classification Table (ASVs) - The lines below obtain names of the samples derived from the file names to eventually use further down the pipeline. IMPORTANT: If there are duplicate names, R will merge them - ensure they are unique!!. The line below can be altered based on your data, but ensure the sample name is obtained at a point that will ensure no sample names are the same. For example, if I specify the file name to split at “_S" for two files that are named 193*_S193*_L001_R1_001.fastq.gz and 193*_S342*_L001_R1_001.fastq.gz, I will end up with sample names that are 193 and 193, even though the samples are different. The information for these files will merge, and will break the rest of the pipeline. As of now, the sample name is split from “001_R”, which is typically what works for the fastq files inputted.

366 nameSample <-sapply(strsplit(samplenam, "001_R"), `[`, 1)



4. IdTaxa Classification Table (ASVs) - The line below names your samples Samples, as is specified by what is in brackets of the first line below. This is rather arbitrary, as it doesn’t impact the final summary. However, if you are viewing objects in R and want to specify what the samples are such as naming them Deer or Cattle, two changes need to be made. If the line Samples=nameSample is changed to Deer=nameSample, then line 565 must be changed from “Samples” to “Deer” where the timevar = parameter is specified.

367 dfagain<-data.frame(Samples=nameSample)

...

565 newdfTEST <- reshape(classifTEST, timevar = "Samples", idvar = c( "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), direction = "wide")


Troubleshooting

Column

To be Continued

  • google the error message

Outputs

Column

Note if you are using the server and are unfamiliar with how to move files on and off the server, the Server tab has a section on moving files!

FINALSUMMARY.xlsx - Final Excel Workbook Summary

This excel workbook includes 9 sheets:

  • CollapsedSpeciesIdTaxa: IdTaxa classification results collapsed and summed by species.

  • IdTaxaClassification: IdTaxa classification results by ASV.

  • IdTaxaConfidence: IdTaxa confidence values for each taxonomic rank by ASV.

  • BLAST: Results from BLAST by ASV.

  • All: IdTaxa classification results combined with BLAST results in one sheet by ASV (basically IdTaxaClassification + BLAST).

  • AssignTaxonomy: AssignTaxonomy classification results by ASV.

  • TrackReads: Reads tracked throughout each major filtering step in the pipeline, including how many were inputted, how many left after filtering, how many left after denoising the data, how many left after merging forward and reverse reads, and finally how many remain after removing chimeras. This last value is how many will be classified.

  • ReadsClassified: This sheet includes the total number of reads classified to the species level by AssignTaxonomy: AssignedToSpAT, and IdTaxa: AssignedToSpIdTax.

  • Info: This sheet has been purposely left empty. This is where any additional notes/information can be recorded, including the date, what parameters were changed from default, any supplementary files, database used, etc.

.rds Files

Files with the file extension .rds are files that can be imported back into R on your local computer to view locally.
If there are additional outputs of the pipeline you would like to be able to view, write the object out as an rds file with the line below, using the sequence table produced by the pipeline as an example:

saveRDS(seqtab, "SequenceTable.rds"))

Now you will have the rds file SequenceTable.rds in your home directory on the server. To view this locally, move this file onto your local computer, and read it into R with the line below:

seqtab <- readRDS("~/Desktop/SequenceTable.rds")

You will have to change the directory depending on where the file is located on your local computer!

locally

# use head() or class()  or View () to view diff. parts of info 

Server

Column

Switching from Synergy to ARC

If you have a synergy account and are switching to ARC, it is a relatively easy process! To request an ARC account, email support@hpc.ucalgary.ca. Subject to change due to possibility of group ARC account. Once you have an ARC account, login to the server using the line below (changing it to your username):

ssh jill.derijke@arc.ucalgary.ca

Whereas you need to access the vpn on synergy, ARC is accessible if you are connected to the schools wifi - an easier login.
To move files from your synergy account to your ARC account, use the line below while on synergy. Keep in mind: the ARC server has a space limit of 500GB. This should be enough space, but be aware of your use! rsync will need to be installed to use the below line (highly recommended for file moving).

rsync -a . jill.derijke@arc.ucalgary.ca:

This line moves everything from synergy to ARC.

Starting with a Fresh ARC Account

If you are new to using servers and just have an ARC account or don’t have the sufficient files to run the pipeline the easiest thing is to use an already existing conda environment which will contain everything needed to run R and efficiently move files etc. This will be available as EnvForPipeline.txt file on dropbox.
To reiterate, at this stage all you have is an empty ARC account. To get the conda environment, follow the steps below:
1. Follow this link https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html.
2. Go to Miniconda installer for Linux.
3. Scroll to Linux installer.
4. Click the most recent version available (64-bit). You will be prompted to save the file - all you need is the name of the file if it is updated from the one below. Replace the part of the code below specifying the version of Miniconda3 with the most updated.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

5. Once it has saved, run the code below (ensure this code is updated too - it needs to match the version saved).

bash Miniconda3-latest-Linux-x86_64.sh

You will be prompted to hit enter, then answer yes or no possibly more than once - review, and answer yes if everything seems right. Once completed, the following message will appear:

For this change to become active, you have to open a new terminal.

Thank you for installing Miniconda3!

6. Open a new terminal, login to ARC, and move the EnvForPipeline.txt file onto your server. To do so, run the line below on a fresh terminal (not on the server), changing the directories to the correct paths for your file.

rsync -a ~/Desktop/EnvForPipeline.txt  jill.derijke@arc.ucalgary.ca:~

7. Once the EnvForPipeline.txt file is on the server, run the line below:

conda install --file EnvForPipeline.txt

You may get this error:

ParseError: Could not parse explicit URL: https://repo.anaconda.com/pkgs/main/linux-64/libstdcxx-ng-9.1.0-hdf63c60_0.conda

In which case you may need to update the conda installed. Even if just recently installed, for some reason it may not be the most updated version. Use the line below to update.

conda update conda -n root

Run the below line again

conda install --file EnvForPipeline.txt

Moving Files

ON A FRESH TERMINAL WINDOW: not logged into ARC

Below is the line of code that will move files from your local computer to the server. These are directories, so ensure you specify where everything is going and customize the lines to your account/computer.

rsync -a ~/FileOnLocalComputer jill.derijke@arc.ucalgary.ca:~  

Below is the line of code that will move files from the server to your local computer.

rsync -a jill.derijke@arc.ucalgary.ca:~/FileOnServer.xlsx ~/PlaceYouWantToMoveFileToOnLocalComputer

Useful Command Line Commands

The command below allows you to view everything in your working directory (which is usually home).

ls 

This command reads the file you apply it to, such as the test_000000.out or test_000000.err file. Once in the file ctrl + g will skip you to the bottom (most recent output).

less test_000000.out

The command below will print the directory you are currently in.

pwd 

The command below allows you to change directories within the server. The character ~ means home, and can be used to move you back to the home directory.

cd specify/directory/

Removing files (permanantly!) is done with the command below.

rm filename


Running Pipeline on Server

LOGGED ONTO ARC (THE SERVER)

Type the line below in the terminal when on ARC. A job script needs to be created to run the pipeline on the server, and the line below creates a textfile.

nano FinalPipelineSERVER.slurm

Copy and paste this into the empty .sh file
Make sure the server directory to the pipeline is correct - where it says Rscript below

#! /usr/bin/env bash

#SBATCH --job-name=FinalPipeline.job
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --cpus-per-task=12
#SBATCH --time=06:00:00
#SBATCH --mem=0
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --partition=cpu2019
#SBATCH --kill-on-invalid-dep=yes

Rscript /home/jill.derijke/FinalPipelineSERVER.R

ctrl + o then Enter will write out the above to the .sh file
ctrl + x will exit out of the editing screen

The line below will submit your job to the server

sbatch FinalPipelineSERVER.slurm

The line below allows you to check the status of your submission

squeue -u jill.derijke

The line below will kill your job - make sure you use the correct job id and don’t kill someone else’s!!

scancel 1982480

Considerations

Column

To be Continued

  • ensure no files are named the same (R will use whichever and it may mess up your data)

  • change multithread to 12 – then use 12 CPUs on the server (this should already be done for you)

  • files need to be gzipped (?)

  • you may need to manually remove sample files that do not pass filtering

Graphing

Column

Contents

This section includes a brief explanation of the code used to create interactive stacked barplots for a plot with and without labels.

Hover Text

The interactive plots (seen below) are useful for multiple reasons. A significant one is the ability to view various pieces of information from different classification algorithms (BLAST and IdTaxa). This makes for a very easy visual comparison of any classification discrepencies instead of staring at spreadsheets. More information on interactive plots can be found linked in the Links tab.

Contents of the hover text:
* Sample: the sample this information comes from
* Abundance: abundance of reads mapping to this ASV/classification - relative and absolute depending on if the bar plots contain labels
* Species: the species classification from IdTaxa has
* ASV: the Amplicon Sequence Variant that was classified
* BLAST_hit: the result of BLASTing the ASV with blastn and NCBIs database

Example Graph

Code For Graph

The code below graphs data from the final excel summary sheet into an interactive stacked barplot.

Change the below lines where it says “~/5primerSamples.xlsx” to the directory of your final summary excel sheet (only what is within quotes):

blastgraph <- read.xlsx("~/5primerSamples.xlsx", sheet = "BLAST") ## using sheet with all blast # local (itd be on server)
blastgraph1 <- subset(blastgraph, select = c(ASV, Assignment.1)) # slecting the top hit

IdTaxgraph <- read.xlsx("~/5primerSamples.xlsx", sheet = "IdTaxaClassification")
IdTaxgraph1 <- subset(IdTaxgraph, select = -c(ASV, ASVnumber))


The lines below format the dataframe to include a first column of ASVs, and combine select information from the two excel sheets imported:

colnames(IdTaxgraph1)[1] <- c("ASV")

ToGraph <- join(IdTaxgraph1, blastgraph1, by = "ASV")


Below, the dataframe is further subsetted to only contain information relevent to graphing. This can be viewed through running the second line, which will open a new tab in R studio with the dataframe.

GraphingDF <- subset(ToGraph, select = -c(domain, kingdom, phylum, class, order, family, genus)) 
View(GraphingDF) ## makes the dataframe species, samples, ASV, Assignment.1


Below is more data formatting. Because the BLAST output is differently formatted than IdTaxas, the lines below take the first two words from the BLAST assignment (the species classification) and format it so it is more consistent with IdTaxa.

GraphingDF$Assignment.1 <- word(GraphingDF$Assignment.1, 1,2) ## taking first two words from blast assignment
GraphingDF$Assignment.1 <- gsub(' ', '_', GraphingDF$Assignment.1) ## adding underscore to make it consistent
GraphingDF2 <- melt(GraphingDF, id=c("species", "ASV", "Assignment.1"))


The lines below may be important to alter. The first line renames the columns of the dataframe to be graphed. What is within the quotes can be changed: for example, “Sample” can be changed to be more specific, such as “Deer” or “Cattle”. However, you will have to keep that consistant with wherever else the “Sample” column is used.
The second line in the code block below removes any unnecessary characters within the sample names. “\_S." will remove anything after the character _S. This can be changed, but leave the rest of what is within the quotes - only replace the _S part. In the case of sample files named 193_S193_L001_R1_001.fastq.gz, this line of code would change the names to be 193.
The third line in the code below removes characters not wanted in the sample name. For example, if all sample names started with an “X”, this line would remove it. Again this can be altered. Note where it says Sample, as in GraphingDF$Sample. If you change the Sample column name to something else, it will need to be changed here!!
The last line below changes all incidences of classifications as NA to Unclassified for clarity when plotting.

names(GraphingDF2) <- c("Species", "ASV", "BLAST_hit","Sample","Abundance")
GraphingDF2$Sample <- gsub("\\_S.*","",GraphingDF2$Sample) ## removing deer sample name (the _S part) -- to change based on samples
# GraphingDF2$Sample <- GraphingDF2$Sample %>% str_remove("X") ## to change based on samples
GraphingDF2[is.na(GraphingDF2)] <- c("Unclassified")


Below is what actually graphs the information!

The below line specifies axes and hovertext. If the x variable (Sample) has been changed, you will have to change it here as well! Otherwise R won’t know what you are specifying.

GraphingDF2plot <- ggplot(GraphingDF2[which(GraphingDF2$Abundance>0),], aes(x = Sample, y = Abundance, fill = Species, ASV = ASV, BLAST = BLAST_hit)) +


Below, the title is written within the quotes. Change this!

  ggtitle("Classified with IdTaxa - Hover Over Bars to Review BLAST Assignments etc") +


The line below dictates what the bars do within the barplot. They are filled by identity of the species to create a stacked bar by species composition.

  geom_bar(position = position_fill(reverse = TRUE), stat = "identity") +


The line below is what includes text on the bars, in this case the abundance of reads for each ASV. Further down there is a second example with this removed.

  geom_text(aes(label=Abundance), position=position_fill(reverse= TRUE, vjust = 0.5), size = 2, check_overlap = TRUE) +


These last two lines turn the sample name on the x axis 90 degrees for proper formatting, and feed the colours in for the graph. The colours are contained in an object called manualcolors3, which can be found at the beginning of the RDS_Graphing_FinalPipeline.R file.

  theme(axis.text.x = element_text(angle=90, hjust = 1))+
  scale_fill_manual(values = manualcolors3)


This line is the ggplotly function, which interactively graphs an otherwise stationary plot. Run ExampleGraph to view within R, otherwise save as an RDS to import into a flashboard like this one if wanted! It is also exportable as a web page in R, but won’t be sharable.

ExampleGraph <- ggplotly(GraphingDF2plot)

saveRDS(ExampleGraph, "GraphExample.rds")


Example Graph 2

Code for Graph 2


Most graphs will look like the one above! Colours have been edited to match what is usually used in the lab - colours for all other species have been basically randomly chosen. Compared to the first graph shown, the numbers showing the absolute abundance of reads for each ASV have been removed for clarity. This can be done by editing code to be the following:

GraphingDF2plot <- ggplot(GraphingDF2[which(GraphingDF2$Abundance>0),], aes(x = Sample, y = Abundance, fill = Species, ASV = ASV, BLAST = BLAST_hit)) +
  ggtitle("Classified with IdTaxa - Hover Over Bars to Review BLAST Assignments etc") +
  geom_bar(position = position_fill(reverse = TRUE), stat = "identity") +
  
  # the line below was not run, this removes the numbered read abundance labels 
  # geom_text(aes(label=Abundance), position=position_fill(reverse= TRUE, vjust = 0.5), size = 2, check_overlap = TRUE) +
  
  theme(axis.text.x = element_text(angle=90, hjust = 1))+
  scale_fill_manual(values = manualcolors3)